Predicting Quality of Portuguese Vinho Verde White Wine
Group 009E05
Dongchao Chen, Katie Barton, Kyden Wang, Mason Feng, Xin Mu
The University of Sydney
How to pick a good bottle of white wine?
Introduction
Sample: Portuguese Vinho Verde White Wine
Key measurement of “Good wine”: Quality Assessment
11 predictor variables: physiochemical properties e.g. density, alcohol, pH
We want to determine which physiochemical factors of white wine are most significant in predicting quality. This has broader implications for consumer purchase decisions as well as company pricing selection (percieved higher quality = premium prices = increased profit margins). Despite the data set containing both red and white wine data, we have chosen to focus on white over red due to the data set size (4898 data points vs 1599 data points). The analysed white wine samples are from the vinho verde region of Portugal.
Quality distribution of white wine
Visualize all variables and dependent variables
Heat map
Model Selection
Stepwise Selection
LASSO Regression
Ordinal Logistic Regression
Performance Metrics
In order to create our model, we must first choose our predictor variables. We will compare 2 methods:
Stepwise Selection
Lasso Regression
We will compare these models using \(RMSE\) , \(MAE\) , \(R^2\) and \(AIC\) . From our EDA we noticed that both “alcohol” and “residual.sugar” have high correlations with “density”. This means that we need to be careful of potential multicollinearity and we may want to remove “density” from our model.
Multicollinearity means that our predictors are correlated. We want to reduce this in our model which will be done by checking the Variance Inflation Factor (VIF) of our variables.
Stepwise Selection
After 10-fold CV:
\[\begin{align}
\widehat{\text{quality}} = &154.106 + 0.068(\text{fixed.acidity}) -\\
&1.888(\text{volatile.acidity}) + 0.083(\text{residual.sugar}) +\\
&0.003(\text{free.sulfur.dioxide}) - 154.291(\text{density}) +\\
&0.694(\text{pH}) + 0.629(\text{sulphates}) + 0.193(\text{alcohol})\\
\end{align}\\
\;\\\]
\[\begin{array}{c|cccc}
& \textrm{RMSE} & \textrm{MAE} & R^2 & \textrm{AIC}\\
\hline
\textrm{Stepwise Select} & 0.753 & 0.585 & 0.278 & 11171.41\\
\end{array}
\;\\\]
Forward and backward select chose the same model! But what about multicollinearity ? We can check with VIF .
For this method we will be comparing both forward and backwards selection. We know from lectures, that this uses AIC as the performance metric. Lets perform stepwise selection, then perform 10-fold CV:
Both forward and backward select chose the same model, which result in the exact same performance metrics. However, we haven’t dealt with the issue of multicollinearity. We will now check using VIF.
Variable selection using forward/backward selection found that the inclusion of total sulfur dioxide, chlorides (salty) and citric acid (sour) were NOT significant in predicting white wine quality.
Dealing with multicollinearity
\[VIF_i = \frac{1}{1-R^2_i}\]
\[\begin{array}{ccccc}
\textrm{volatile.acidity} & \textrm{sulfur.dioxide} & \textrm{pH} & \textrm{sulphates} & \textrm{fixed.acidity}\\
\hline
1.057 & 1.149 & 2.114 & 1.130 & 2.580\\
\end{array}\]
\[\begin{array}{ccc}
\textrm{alcohol} & \textrm{residual.sugar} & \textrm{density}\\
\hline
7.623 & 11.854 & 26.123\\
\end{array}\]
Remove “density”! But what if there was a better way?
VIF is a measure of multicollinearity in the model, where higher values signify higher correlation (which we want to avoid!). The formula is as follows:
Generally we want to ensure that every variable has a VIF that is <5
We observe that we have three variables that have VIF>5
“alcohol”, “residual.sugar” and “density”, which is as expected from our EDA. To reduce our multicollinearity, we can remove variables with high multicollinearity, but even then, stepwise selection has a multitude of statistical problems. What if there was a better way of variable selection?
Lasso Regression
\[\beta^{lasso}_\lambda = \underset{\beta}{\operatorname{\arg\max}} \Biggl\{ \underbrace{\sum_{i=1}^n\Biggl( y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij}\Biggr)^2}_{\text{Residual Sum of Squares}\; (RSS)}+\lambda\sum_{j=1}^p|\beta_j| \Biggr\}\]
Least Absolute Shrinkage and Selection Operation or LASSO, is a regression method utilising \(\ell_1\) -regularisation where the parameters of the regression model are:
LASSO is often used when we also need to do variable selection, since it tends to shrink some coefficients to 0. Therefore, we can use LASSO to perform our variable selection and modelling at the same time. We also do not need to worry about multicollinearity as LASSO shrinks the coefficients of these problematic variables to 0.
The main idea of LASSO is that we introduce a small amount of Bias into the way we fit our model. But in return for that small amount of Bias we get a significant drop in Variance .
Remembering the LASSO regression formula, we see that we need to choose a suitable hyper-parameter \(\lambda\) , which is done through Cross Validation e.g. 10-fold CV.
Lasso Regression
\[\beta^{lasso}_\lambda = \underset{\beta}{\operatorname{\arg\max}} \Biggl\{ \underbrace{\sum_{i=1}^n\Biggl( y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij}\Biggr)^2}_{\text{Residual Sum of Squares}\; (RSS)}+\lambda\sum_{j=1}^p|\beta_j| \Biggr\}\]
Least Absolute Shrinkage and Selection Operation or LASSO, is a regression method utilising \(\ell_1\) -regularisation where the parameters of the regression model are:
LASSO is often used when we also need to do variable selection, since it tends to shrink some coefficients to 0. Therefore, we can use LASSO to perform our variable selection and modelling at the same time. We also do not need to worry about multicollinearity as LASSO shrinks the coefficients of these problematic variables to 0.
The main idea of LASSO is that we introduce a small amount of Bias into the way we fit our model. But in return for that small amount of Bias we get a significant drop in Variance .
Remembering the LASSO regression formula, we see that we need to choose a suitable hyper-parameter \(\lambda\) , which is done through Cross Validation e.g. 10-fold CV.
Ordinal Logit Regression
Our independent variable “quality” is an ordinal variable. We utilise the log-odds also known as the logit . Lets say we have \(J\) categories:
\[ P(Y\leq j) \]
\[\log\Biggl(\frac{P(Y\leq j)}{P(Y> j)}\Biggr) = \text{logit}(P(Y\leq j))\] \[\text{logit}(P(Y\leq j)) = \beta_0 + \sum_{i=1}^{J-1}(-\beta_jx_j)\] {.absolute bottom=“0” width=“320” height=“200” right = “0”}
Ordinal Logit Regression
Our independent variable “quality” is an ordinal variable. We utilise the log-odds also known as the logit . Lets say we have \(J\) categories:
\[ P(Y\leq j) \]
\[\log\Biggl(\frac{P(Y\leq j)}{P(Y> j)}\Biggr) = \text{logit}(P(Y\leq j))\] \[\text{logit}(P(Y\leq j)) = \beta_0 + \sum_{i=1}^{J-1}(-\beta_jx_j)\] {.absolute bottom=“0” width=“320” height=“200” right = “0”}
Ordinal Logit Regression
\[\begin{align}
\text{logit}(P(\widehat{\text{quality}}\leq i) = &\alpha_i -(- 0.140(\text{fixed.acidity}) -\\
&5.336(\text{volatile.acidity}) + 0.060(\text{residual.sugar}) -\\
&2.809(\text{chlorides}) + 0.011(\text{free.sulfur.dioxide}) +\\
&0.423(\text{pH}) + 1.090(\text{sulphates}) + 0.972(\text{alcohol}))\\
\end{align}\\
\;\\
\alpha_i = (4.035, 6.378, 9.404, 11.959, 14.198, 17.874)\]
Where \(i\) denotes the boundary between quality 3 & 4, quality 4 & 5, quality 5 & 6, etc.
Ordinal Logit Regression Example
For a wine with fixed.acidity = 7, volatile.acidity = 0.27, residual.sugar = 20.7, chlorides = 0.045, free.sulfur.dioxide = 45, pH = 3, sulphates = 0.45 and alcohol = 8.8, and we want to find the probability that this wine has a quality of 6 or less :
\[\begin{align}
\text{logit}P(\widehat{\text{quality}}\leq 6) = &9.404 -(- 0.140(\text{fixed.acidity}) -\\
&5.336(\text{volatile.acidity}) + 0.060(\text{residual.sugar}) -\\
&2.809(\text{chlorides}) + 0.011(\text{free.sulfur.dioxide}) +\\
&0.423(\text{pH}) + 1.090(\text{sulphates}) + 0.972(\text{alcohol}))\\
P(\widehat{\text{quality}}\leq 6) = &\exp(-0.099) = 0.906
\end{align}\\\]
If we want the probability that the wine is of quality 6:
\[\begin{align}
P(\widehat{\text{quality}} = 6) = & P(\widehat{\text{quality}}\leq 6) - P(\widehat{\text{quality}}\leq 5) - P(\widehat{\text{quality}}\leq 4)\\
= &0.858
\end{align}\\\]
So we have an \(85.5\%\) probability that our particular wine has a quality of 6.
Ordinal Logit Regression
\[\begin{array}{c|ccccccc}
& 3 & 4 & 5 & 6 & 7 & 8 & 9 \\
\hline
3 & 0 & 1 & 7 & 11 & 1 & 0 & 0\\
4 & 0 & 4 & 96 & 62 & 1 & 0 & 0\\
5 & 0 & 2 & 723 & 724 & 8 & 0 & 0\\
6 & 0 & 0 & 377 & 1678 & 143 & 0 & 0\\
7 & 0 & 0 & 52 & 642 & 186 & 0 & 0\\
8 & 0 & 0 & 1 & 117 & 57 & 0 & 0\\
9 & 0 & 0 & 0 & 3 & 2 & 0 & 0\\
\end{array}\\\]
\[\begin{array}{c|cccc}
& \textrm{RMSE} & \textrm{MAE} & R^2 & \textrm{AIC}\\
\hline
\textrm{Stepwise Select} & 0.753 & 0.585 & 0.278 & 11171.41\\
\textrm{LASSO} & 0.751 & 0.584 & 0.281 & 11173.49\\
\textrm{Ordinal} & - & - & 0.312^* & 11001.58\\
\end{array}\]
Conclusions
Best model: Ordinal Logit Regression Model
\[\begin{array}{c|cccc}
& \textrm{RMSE} & \textrm{MAE} & R^2 & \textrm{AIC}\\
\hline
\textrm{Stepwise Select} & 0.753 & 0.585 & 0.278 & 11171.41\\
\textrm{LASSO} & 0.751 & 0.584 & 0.281 & 11173.49\\
\textrm{Ordinal} & - & - & 0.312^* & 11001.58\\
\end{array}\]
OLR predicted correctly 54% of the time.
\(\;\)
Trade-off: Better accuracy for worse Interpretation
Conclusions
For consumers: Nutrition facts can be helpful for wine choice.
For companies: Raise prices for higher perceived quality and increased profits.
Conclusions
For consumers: Nutrition facts can be helpful for wine choice.
For companies: Raise prices for higher perceived quality and increased profits.
Ordinal model selection outperformed stepwise selection and Lasso regression methods with a smaller RMSE, MAE, \(R^2\) and AIC, thus gave a more parsimonious model.